https://medium.com/@alice.yang_10652/find-and-highlight-text-in-pdf-with-python-213deeed1718
Find and Highlight Text in PDF with Python | by Alice Yang | Medium
Find and Highlight Text in PDF with Python
PDF (Portable Document Format) files are widely used for sharing and preserving documents with their original formatting intact. When working with lengthy PDF documents, finding specific information can be time-consuming. That’s where the Find and highlight text feature becomes invaluable. By utilizing this feature, you can quickly locate relevant information, extract important details, and create visual markers for easy reference. This article will explore how to find and highlight text in PDF using Python. It covers the following topics:
- Find and Highlight Text in PDF with Python
- Find and Highlight Text in a PDF Page Area with Python
- Find and Highlight Text in PDF using Regex with Python
- Find Text and Get Its Coordinates in PDF with Python
Python Library to Find and Highlight Text in PDF
To find and highlight text in PDF files with Python, we will use Spire.PDF for Python. It is a feature-rich and user-friendly library designed to create, read, edit, and convert PDF files within Python applications.
You can install Spire.PDF for Python from PyPI using the following pip command:
pip install Spire.Pdf
If you already have Spire.PDF for Python installed and would like to upgrade to the latest version, use the following pip command:
pip install --upgrade Spire.Pdf
For more detailed information about the installation, you can check this official documentation: How to Install Spire.PDF for Python in VS Code.
Find and Highlight Text in PDF with Python
The PdfTextFinder class in Spire.PDF for Python is used to search for text in PDF documents. Using the Find() method of this class, you can find a specific word or sentence on PDF pages. Then you can highlight each found instance of the text with a bright color, and get the number of instances and the corresponding page numbers.
Here are the steps to find and highlight text in a PDF document with Python:
- Create an instance of the PdfDocument class and load the PDF document using the PdfDocument.LoadFromFile() method.
- Initialize a counter to track the number of text instances and a list to store the page numbers where the text appears.
- Iterate through the pages in the PDF.
- For each page, create a PdfTextFinder instance and set the text finding parameters (such as WholeWord, IgnoreCase) through the PdfTextFinder.Options.Parameter property.
- Use the PdfTextFinder.Find() method to search for the specific text on the page. This method will return a list of PdfTextFragment objects, each representing an instance of the text in the document.
- Iterate through the PdfTextFragment objects in the list. Then use PdfTextFragment.Highlight() method to highlight each instance, increment the count of text instances and add the current page number to the list.
- Use the PdfDocument.SaveToFile() method to save the resulting document to a new file.
- Print the number of text instances and the page numbers.
Here is a code example of how to find and highlight text in a PDF with Python:
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Adobe Acrobat.pdf")
# Initialize a counter to keep track of the number of instances
instance_count = 0
# Initialize a list to store the page numbers
page_numbers = []
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
page = doc.Pages[i]
# Create a PdfTextFinder instance
finder = PdfTextFinder(page)
# Set the text finding parameter
finder.Options.Parameter = TextFindParameter.WholeWord
# Find a specific text
results = finder.Find("Adobe Acrobat")
# Highlight all instances of the specific text
for text in results:
text.HighLight(Color.get_Yellow())
# Increment the instance count
instance_count += 1
# Add the page number to the list
page_numbers.append(i+1)
# Save the result file
doc.SaveToFile("FindAndHighlightText.pdf")
# Print the number of instances and the page numbers
print(f"The text 'Adobe Acrobat' appears {instance_count} times in the PDF.")
print(f"The text appears on the following pages: {', '.join(map(str, page_numbers))}")
Find and Highlight Text in a PDF Page Area with Python
In some cases, you may need to find and highlight text within a specific area or region of a PDF page, rather than the entire page. Using the PdfTextFinder.Options.Area property, you can easily define the page area to search for text.
Here are the steps to find and highlight text in a specific PDF page area with Python:
- Create an instance of the PdfDocument class and use the PdfDocument.LoadFromFile() method to load the PDF document.
- Iterate through the pages in the PDF.
- For each page, create a PdfTextFinder instance and set the page area to search for text through the PdfTextFinder.Options.Area property.
- Use the PdfTextFinder.Find() method to search for the specific text in the page area.
- Highlight each found instance using the PdfTextFragment.Highlight() method.
- Save the resulting document using PdfDocument.SaveToFile() method.
Here is a code example of how to find and highlight text in a specific PDF page area with Python:
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Adobe Acrobat.pdf")
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
page = doc.Pages[i]
# Create a PdfTextFinder instance
finder = PdfTextFinder(page)
# Set the page area to search for text
finder.Options.Area = RectangleF(0.0, 0.0, 300.0, 300.0)
# Find a specific text
results = finder.Find("Adobe Acrobat")
# Highlight all instances of the specific text
for text in results:
text.HighLight(Color.get_Yellow())
# Save the resulting file
doc.SaveToFile("FindAndHighlightTextInPageArea.pdf")
doc.Close()
Find and Highlight Text in PDF using Regex with Python
Regular expressions (regex) are a powerful tool for performing sophisticated text finds, allowing you to precisely match and extract information based on complex patterns and rules.
To search and highlight text using regular expression in a PDF, you first need to set the PdfTextFinder.Options.Parameter property to TextFindParameter.Regex to enable regex-based searching. Then, pass the regular expression as a parameter to the Find() method to implement text searching based on regular expression.
Here are the steps to find and highlight text in a PDF using regular expression with Python:
- Create an instance of the PdfDocument class and use the PdfDocument.LoadFromFile() method to load the PDF document.
- Iterate through the pages in the PDF.
- For each page, create a PdfTextFinder instance and set the PdfTextFinder.Options.Parameter property to TextFindParameter.Regex to enable regular expression-based searching.
- Pass the regular expression to the PdfTextFinder.Find() method to implement text searching based on regular expressions.
- Use the PdfTextFragment.Highlight() method to highlight each matched instance.
- Save the resulting document using the PdfDocument.SaveToFile() method.
Here is a code example of how to find and highlight text in a PDF using regular expression in Python:
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Template.pdf")
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
page = doc.Pages[i]
# Create a PdfTextFinder instance
finder = PdfTextFinder(page)
# Set the text finding parameter to enable regex-based searching
finder.Options.Parameter = TextFindParameter.Regex
# Find the text starting with the symbol "#"
results = finder.Find("""\\#\\w+\\b""")
# Highlight all matched text
for text in results:
text.HighLight(Color.get_Yellow())
# Save the resulting document
doc.SaveToFile("FindAndHighlightTextUsingRegex.pdf")
doc.Close()
Find Text and Get Its Coordinates in PDF with Python
You can find specific text in a PDF and retrieve the coordinates of each found text instance through the PdfTextFragment.Positions[].X and PdfTextFragment.Positions[].Y properties.
Here are the steps to find text in a PDF and retrieve the coordinates of each found instance with Python:
- Create an instance of the PdfDocument class and use the PdfDocument.LoadFromFile() method to load the PDF document.
- Iterate through the pages in the PDF.
- For each page, create a PdfTextFinder instance.
- Use the PdfTextFinder.Find() method to search for the specific text.
- Use the PdfTextFragment.Positions[0].X and PdfTextFragment.Positions[0].Y properties to get the X and Y coordinates of each found instance.
Here is a code example of how to find text in a PDF and retrieve the coordinates of each found instance with Python:
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Adobe Acrobat.pdf")
# Iterate through the pages in the document
for i in range(doc.Pages.Count):
page = doc.Pages[i]
# Create a PdfTextFinder instance
finder = PdfTextFinder(page)
# Find a specific text
results = finder.Find("Adobe Acrobat")
# Print the coordinates of each found instance
for text in results:
print(f"Text Position: ({text.Positions[0].X}, {text.Positions[0].Y})")
doc.Close()
Conclusion
This blog demonstrated various scenarios for searching and highlighting text in PDF using Python. Additionally, it also explained how to get the coordinates of specific text in PDF using Python. We hope you find it helpful.